Streaming Large Data Pickle Files in Python

Streaming Large Data to/from Pickle Files in Python

Yes, you can process pickle files in a streaming fashion to handle very large datasets without loading everything into memory at once. Here's how to do it:

Streaming Write to Pickle File

import pickle

def stream_write_to_pickle(data_iterable, filename):
    """Write data to pickle file in streaming fashion"""
    with open(filename, 'wb') as f:
        for item in data_iterable:
            pickle.dump(item, f)
            # Clear memory after writing if needed
            del item

# Example usage:
def generate_large_data():
    """Generator that yields dictionaries one by one"""
    for i in range(1000000):  # 1 million items
        yield {'id': i, 'data': 'x' * 100}  # Example large dictionary

# Write in streaming fashion
stream_write_to_pickle(generate_large_data(), 'large_data.pkl')

Streaming Read from Pickle File

import pickle
import os

def stream_read_from_pickle(filename):
    """Generator that reads pickle file item by item"""
    with open(filename, 'rb') as f:
        while True:
            try:
                yield pickle.load(f)
            except EOFError:
                break

# Example usage:
# Process one item at a time without loading all into memory
for item in stream_read_from_pickle('large_data.pkl'):
    # Process your item here
    print(f"Processing item {item['id']}")
    # Item will be garbage collected after processing

Advanced Version with Batch Processing

For better performance with very large files, you can process in batches:

def batch_read_from_pickle(filename, batch_size=1000):
    """Read pickle file in batches"""
    batch = []
    with open(filename, 'rb') as f:
        while True:
            try:
                batch.append(pickle.load(f))
                if len(batch) >= batch_size:
                    yield batch
                    batch = []
            except EOFError:
                if batch:  # yield remaining items
                    yield batch
                break

# Example usage:
for batch in batch_read_from_pickle('large_data.pkl', batch_size=500):
    print(f"Processing batch of {len(batch)} items")
    # Process the batch
    del batch  # free memory explicitly

Important Notes

File Structure: This creates a pickle file containing multiple pickled objects in sequence (not a single list).
Performance: Streaming is slower than bulk operations but uses much less memory.
Compatibility: All items must be pickle-able (most Python objects are).
Error Handling: The EOFError catch is how we detect the end of file.
Memory Management: Explicit del statements help with memory management for very large items.

This approach lets you handle datasets much larger than your available RAM since you only keep one item (or a small batch) in memory at a time.

Batch Processing with Pickle (5 Dictionaries at a Time)

Here's how to write and read a list of dictionaries in batches of 5 using Python's pickle module:

Writing in Batches (5 at a time)

import pickle

def write_batches_to_pickle(data_list, filename, batch_size=5):
    """
    Write list of dictionaries to pickle file in batches
    :param data_list: Your list of dictionaries
    :param filename: Output pickle file
    :param batch_size: Number of items per batch (default 5)
    """
    with open(filename, 'wb') as f:
        for i in range(0, len(data_list), batch_size):
            batch = data_list[i:i + batch_size]
            pickle.dump(batch, f)
            
    # Clear the original list if needed
    data_list.clear()

# Example usage:
large_list = [{'id': x, 'data': f'sample{x}'} for x in range(100000)]  # 100k items
write_batches_to_pickle(large_list, 'batched_data.pkl')

Reading in Batches (5 at a time)

def read_batches_from_pickle(filename):
    """
    Generator that yields batches of dictionaries from pickle file
    :param filename: Pickle file containing batched data
    :yields: Lists of dictionaries (batches of 5)
    """
    with open(filename, 'rb') as f:
        while True:
            try:
                yield pickle.load(f)
            except EOFError:
                break

# Example usage:
for batch in read_batches_from_pickle('batched_data.pkl'):
    print(f"Processing batch of {len(batch)} items:")
    for item in batch:
        print(f"  - Item ID: {item['id']}")
    # Batch will be garbage collected after processing

Complete Workflow Example

# Create sample data (100k dictionaries)
data = [{'id': i, 'value': i*2} for i in range(100000)]

# Write in batches of 5
write_batches_to_pickle(data, 'big_data.pkl')

# Verify original list is cleared
print(f"Original list length after writing: {len(data)}")  # Should be 0

# Process data in batches later
total_items = 0
for batch in read_batches_from_pickle('big_data.pkl'):
    print(f"Processing batch (size: {len(batch)})")
    total_items += len(batch)
    # Your processing code here
    
print(f"Total items processed: {total_items}")  # Should be 100000

Key Advantages

Memory Efficiency: Only 5 dictionaries are loaded at a time during processing
Simple Implementation: Uses only standard library (pickle)
Flexible Batch Size: Easily adjustable by changing the batch_size parameter
Preserved Structure: Maintains the original list structure through batching

This approach gives you fine-grained control over memory usage while keeping the implementation simple and dependency-free.

Streaming Large Data to/from Pickle Files in Python​

Streaming Write to Pickle File​

Streaming Read from Pickle File​

Advanced Version with Batch Processing​

Important Notes​

Batch Processing with Pickle (5 Dictionaries at a Time)​

Writing in Batches (5 at a time)​

Reading in Batches (5 at a time)​

Complete Workflow Example​

Key Advantages​